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Abstract 

We present a greedy method for simultaneously performing local bandwidth se- 
lection and variable selection in nonparametric regression. The method starts 
with a local linear estimator with large bandwidths, and incrementally decreases 
the bandwidth of variables for which the gradient of the estimator with respect 
to bandwidth is large. The method — called rodeo (regularization of derivative ex- 
pectation operator) — conducts a sequence of hypothesis tests to threshold deriva- 
tives, and is easy to implement. Under certain assumptions on the regression 
function and sampling density, it is shown that the rodeo applied to local linear 
smoothing avoids the curse of dimensionality, achieving near optimal minimax 
rates of convergence in the number of relevant variables, as if these variables were 
isolated in advance. 



Keywords: Nonparametric regression, sparsity, local linear smoothing, bandwidth estima- 
tion, variable selection, minimax rates of convergence. 



Research supported by NSF grants CCR-0122481, IIS-0312814, and IIS-0427206. 

2 Research supported by NIH grants R01-CA54852-07 and MH57881 and NSF grant DMS-0104016. 



1 



I. Introduction 



Estimating a high dimensional regression function is notoriously difficult due to the curse of 
dimensionality. Minimax theory precisely characterizes the curse. Let 



Yt = m(Xi) + e, 



n 



'1.1 s . 



where X$ = (JQ(1), . . . , X^d)) G M d is a <i-dimensional covariate, m : M. d — > R is the unknown 
function to estimate, and e, ~ N(0,a 2 ). Then if m is in W^c), the (i-dimensional Sobolev 
ball of order two and radius c, it is well known that 



lim inf n 4 /( 4+d ) inf sup TZ(m n ,m) > 

n ~ ,oc ™« meW 2 (c) 
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is n 4 /( 4+d ) ? which is impractically slow if d is large. 

However, for some applications it is reasonable to expect that the true function only de- 
pends on a small number of the total covariates. Suppose that m satisfies such a sparseness 
condition, so that 

m(x) = m(xn) (1.3) 

where xr = (xj : j e R), R C {1, . . . , d} is a subset of the d covariates, of size r = \R\ < i 
We call {xj}jeR th e relevant variables. Note that if an oracle were to identify and isolate 
the relevant variables, the better minimax rate of n _4 ^ 4+r ^ could be achieved, and this 
would be the fastest rate possible. Thus, we are faced with the problem of variable selection 
in nonparametric regression. Our strategy is to seek a greedy method that incrementally 
searches through bandwidths in small steps. 

A large body of previous work has addressed this fundamental problem, which has led to a 
variety of methods to combat the curse of dimensionality. Many of these are based on very 



'ormm(x) = ■ m 



•3 \ x j)i 



clever, though often heuristic techniques. For additive models of the 
standard met hods like stepwise se lection, C p and AIC can be used (jHastie et al.ll200lh . For 



re las so adapted 



199l|) effectively 



spline models. IZhang et al.l (120051) use likelihood b asis pursuit, e ssentially t 
to the spline setting. CART (|Breiman et all Il984f ) and MARS (Fr iedmani 
perform variable selection as part of their function fitting. Support vector regression can be 
seen a s creating a s parse representation using basis pursuit in a reproducing kernel Hilbert 

i pds, including methods 



space (jGirosil Il997l ) . There is al so a large literature on Bavesian met 



for sparse Gaussian processes (|Tipping| . hoOll 



Smola and Bartlett 



2001 



Lawrence et al. 
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20031 1 ; see iGeoree and McCullochl ()1997f ) for a brief survey. More recently, 



Li et all (|2005|) 

use independence testing for variable selection and iBuhfmann and Yd (|2005h introduced a 
boosting approach. While these methods have met with varying degrees of empirical success, 
they can be challenging to implement and demanding computationally. Moreover, these 
methods are typically very difficult to analyze theoretically, and so come with limited formal 
guarantees. Indee d, the theoretical analysis of sparse parametric estimators such as the lasso 
(ITibshiranil . 119961 ) is challenging, and only recently has sig nificant progress been made 
this front in t he statistics and signal processing communities (IDonohol l2004t iFu and Knight 



200 



0; 



Trond . 12004 . 



2006; 



Fan and PeneL 12004 Fan and LiL 120011) . 



In this paper we present a new approach for sparse nonparametric function estimation that 
is both computationally simple and amenable to theoretical analysis. We call the general 
framework rodeo, for "regularization of derivative expectation operator." It is based on the 
idea that bandwidth and variable selection can be simultaneously performed by computing 
the infinitesimal change in a nonparametric estimator as a function of the smoothing param- 
eters, and then thresholding these derivatives to get a sparse estimate. As a simple version 
of this principle we use hard thresholding, effectively carrying out a sequence of hypothesis 
tests. A modified version that replaces testing with soft thresholding may be viewed as 
solving a sequence of lasso problems. The potential appeal of this approach is that it can 
be based on relatively simple and theoretically well understood nonparametric techniques 
such as local linear smoothing, leading to methods that are simple to implement and can be 
used in high dimensional problems. Moreover, we show that they can achieve near optimal 
minimax rates of convergence, and therefore circumvent the curse of dimensionality when 
the true function is indeed sparse. When applied in one dimension, our met hod yields a 
local bandwidth selector and is simil ar to t he estimators of lRuppertJ (|l997l ) and lLepski et al. 

and its multivariate extension in Kerkyacharian, 



f)l997l ). The method in lLepski et al. 



Lepski and Picard (2001) yield estimators that are more refined than our method in the sense 
that their estimator is spatially adaptive over large classes of function spaces. However, their 
method is not greedy: it involves searching over a large class of bandwidths. Our goal is to 
develop a greedy method that scales to high dimensions. 



Our method is relate d to the structural adapation method of iHristache et al.l (j200lh and 
Samarov et al.l 1)20051 ) . which is designed for multi- index models. The general multi-index 
model is 

Y = g (Tx) + e (1.4) 



where x G M. d and T is a linear orthonormal mapping from M, d onto W with r < d. Variable 
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selection corresponds to taking T to be a r by d matrix of O's and l's with each = 1 if 
Xj is the i th relevant variable. Nonparame tric variable sel e ction can also be regarded as a 
special case of the partially linear model in 



Samarov et al 



(2005), which takes 



Y = 6 1 x 1 + G(x 2 ) + e (1.5) 

where x = (x\,X2). Taking 9 to be zero yields the model in this paper. The advantage of 
structural adaptation is that it yields, under certain conditions, y/n estimates of the image of 
T in (jl.4|) and 9 in ()1.5|) . However, structural adaptation does not yield optimal bandwidths 
or optimal estimates of the regression function, although this is not the intended goal of the 
method. 

In the following section we outline the basic rodeo approach, which is actually a general 
strategy that can be applied to a wide range of nonparametric estimators. We then specialize 
in Section El to the case of local linear smoothing, since the asymptotic properties of this 
smoothing techniqu e are fa irly well understood. In particular, we build upon the analysis of 
Ruppert and Wandl ( 1994 ) for local linear regression; a notable difference is that we allow 



the dimension to increase with sample size, which requires a more detailed analysis of the 
asymptotics. In Section 0] we present some simple examples of the rodeo, before proceeding 
to an analysis of its properties in Section E| Our main theoretical result characterizes the 
asymptotic running time, selected bandwidths, and risk of the algorithm. Finally, in Section^] 
we present further examples and discuss several extensions of the basic version of the rodeo 
considered in the earlier sections. The proofs of the theoretical properties of the rodeo are 
given in Section 



II. Rodeo: The Main Idea 

The key idea in our approach is as follows. Fix a point x and let fhh(x) denote an estimator 
of m(x) based on a vector of smoothing parameters h = (hi, . . . , hd). If c is a scalar, then 
we write h = c to mean h = (c, . . . , c). 

Let M(h) = E,(rhh(x)) denote the mean of rhh(x). For now, assume that x = Xi is one of 
the observed data points and that rho(x) = Y{. In that case, m(x) = M(0) = E(l^). If 
P = (h(t) : < t < 1) is a smooth path through the set of smoothing parameters with 
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h(0) = and h(l) = 1 (or any other fixed, large bandwidth) then 



m(x) 



M (0) = M(l) + M (0) - M{1) 



(2.1a) 



_ /■ mm ds 

Jo ds 



(2.1b) 



M(l)- / (D(s),h(s))ds 



(2.1c) 



where 



D( *)=™ (k ) = («£,...,«£)* (, 2 ) 



is the gradient of M{h) and /i(s) = is the derivative of h{s) along the path. A biased, 
low variance estimator of m(x) is rhi(x). An unbiased estimator of D{h) is 



v(u\ _ ( dm h (x) dfh h (x)\ 

z{h) -{-^hT--^hr) 



The naive estimator 




is identically equal to rh (x) = which has poor risk since the variance of Z(h) is large 
for small h. However, our sparsity assumption on m suggests that there should be paths for 
which D(h) is also sparse. Along such a path, we replace Z(h) with an estimator D[h) that 
makes use of the sparsity assumption. Our estimate of m(x) is then 



To implement this idea we need to do two things: (i) we need to find a path for which the 
derivative is sparse and (ii) we need to take advantage of this sparseness when estimating D 
along that path. 

The key observation is that if Xj is irrelevant, then we expect that changing the bandwidth 
hj for that variable should cause only a small change in the estimator m/j(x). Conversely, 
if Xj is relevant, then we expect that changing the bandwidth hj for that variable should 
cause a large change in the estimator. Thus, Zj = drhh(x)/dhj should discriminate between 
relevant and irrelevant covariates. To simplify the procedure, we can replace the continuum of 
bandwidths with a discrete set where each hj G B = {ho, (3ho, (3 2 ho, . . .} for some < /3 < 1. 
Moreover, we can proceed in a greedy fashion by estimating D(h) sequentially with hj G B 
and setting Dj(h) = when hj < hj, where hj is the first h such that \Zj(h)\ < Xj(h) for 




(2.5) 
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Figure 1: Conceptual illustration: The bandwidths for the relevant variables (hi) are shrunk, 
while the bandwidths for the irrelevant variables (/12) are kept relatively large. 



some threshold Xj. This greedy version, coupled with the hard threshold estimator, yields 
m(x) = m^(x). A conceptual illustration of the idea is shown in Figure ^ 

To further elucidate the idea, consider now the one-dimensional case x G R, so that 

m(x)=M(l)- [ d ^^-dh = M{\)- [ D(h)dh. (2.6) 



jo dh j 

Suppose that rhh{x) = Y17=i ^ ^( x > h) is a linear estimator, where the weights £i(x, h) depend 
on a bandwidth h. 

In this case 

n 

Z(h)=Y J ^^,h) (2.7) 

i=i 

where the prime denotes differentiation with respect to h. Then we set 

m{x) = fhi(x) - I D{h)dh (2.8) 
Jo 

where D(h) is an estimator of D(h). Now, 

Z{h) ^ N{b{h),s 2 {h)) (2.9) 

where, for typical smoothers, b(h) « Ah and s 2 (h) w C/nh 3 for some constants A and C. 
Take the hard threshold estimator 

D{h) = Z{h)l(\Z{h)\ > \{h)) (2.10) 



where X(h) is chosen to be slightly larger than s(h). An alternative is the soft-threshold 
estimator 

D(h) = sign(Z(h))(\Z(h)\ - X(h)) + . (2.11) 

The greedy algorithm, coupled with the hard threshold estimator, yields a bandwidth se- 
lection procedure based on testing. This approach to bandwidth selection is very similar to 
that of 



Lepski et al.l ()1997l ). who take 



h = max{/i e 7i : (f>(h, rj) = for all rj < h} 



(2.12) 



where <j)(h,r)) is a test for whether m v improves on fhh- This more refined test leads to 
estimators that achieve good spatial adaptation over large function classes. Kerkyacharian, 
Lepski and Picard (20 01) extend the idea to multiple dimensions. Our approach is also 
similar to a method of IRuppertl (J1997) that uses a sequence of decreasing bandwidths and 



then estimates the optimal bandwidth by estimating the mean squared error as a function 
of bandwidth. Our greedy approach only tests whether an infinitesimal change in the band- 
width from its current setting leads to a significant change in the estimate, and is more easily 
extended to a practical method in higher dimensions. 



III. Rodeo Using Local Linear Regression 



Now we present the multivariate rodeo in detail. We use local linear smoothing as the basic 
method since it is known to have many good properties. Let x = {x(l), . . . ,x(d)) be some 
target point at which we want to estimate m. Let ttih{x) denote the local linear estimator 
of m(x) using bandwidth matrix H. Thus, 



m H (x) = e\ (XjW^X^Xj^Y = S X Y 



1 vTi 



where e\ — (1, 0, . . . , 0) J 



/ 1 {X.-xf \ 



(3.1) 



(3.2) 



\ 1 (X n - x) T ) 

W x is diagonal with element K H (Xi — x) and K H (u) = ^^^K^H^^u). The esti- 
mator mn can be written as 



(3.3) 



i=i 
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where 



G(u,x,h) = e^(X^W x X x y l ( 1 \K H {u-x) 

(u — x) 



(3.4) 



is called the effective kernel. One can regard local linear regression as a refinement of k ernel 



regres si on where the effect i ve ke rnel G adjusts for boundary bia s and design bias; see 
(|l992UrIastie and Loaded (|l993h and lRuppert and Wandl (|l994h . 



Fan 



We assume that the covariates are random with density f(x) and that x is interior to the 
support of /. We make the same assumptions as Ruppert and Wand (1994) in their analysis 
of the bias and variance of local linear regression. In particular: 

(i) The kernel K has compact support with zero odd moments and there exists z/2 = 
ViitC) 7^ such that 

J uu T K(u) du = v 2 [K)I (3.5) 
where I is the d x d identity matrix. 

(ii) The sampling density f(x) is continuously differentiable and strictly positive. 



In the version of the algorithm that follows, we take K to be a product kernel and H to be 
diagonal with elements h = (hi, . . . , hf) and we write rhh instead of fan. 



Our method is based on the statistic 

dfhhix) 



where 



Let 



and 



dh-i 



Gj(u, x, h) 



i=i 



J^GjiX^x, h)Y t 
=i 

dG(u, x, h) 



dhi 



Hj = fijQi) = E(Zj\X u X n ) = ^2 G A x h h)m(Xi) 



s] = s){h) = V(Zj\X x , ...,X n )=a 2 J2 Gj(X t , x, hf . 



i=l 



(3.6) 

(3.7) 
(3.8) 

(3.9) 



In Section 14. CI we explain how to estimate a; for now, assume that a is known. The hard 
thresholding version of the rodeo algorithm is described in Figure 121 We make use of a 
sequence c n satisfying dc n = fi(logn), where we write f(n) = Q(g(n)) if liminf r 



M 

9{n) 



> 0. 
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Rodeo: Hard thresholding version 



1. Select (3 n = n~ a / 10 ^ n for some < a < 1 and initial bandwidth 



h = (3.10) 
log log n 

for some constant c . Let c n be a sequence satisfying c n = 0(1). 

2. Initialize the bandwidths, and activate all covariates: 

(a) hj = h , j = l,2,...,d. 

(b) ^ = {l,2,...,d} 

3. While A is nonempty, do for each j G ^4: 

(a) Compute the estimated derivative expectation: Zj (equation l3.6|) and Sj (equation 

E2J). 

(b) Compute the threshold Xj = Sj^/2 log (nc n ). 

(c) If \Zj\ > Xj, then set hj <— /3hj\ otherwise remove j from A. 

4. Output bandwidths h* = (hi, . . . , h d ) and estimator m(x) = m h *(x). 



Figure 2: The hard thresholding version of the rodeo, which can be applied using the deriva- 
tives Zj of any nonparametric smoother. 



To derive an explicit expression for Zj, equivalently Gj, we use 

dh ~ A dh A [dA1} 

to get that 

dm h (x) 

Zj = (3.12a) 

= eJ(X T WX)- 1 X T ^-Y - eJ(X T WX)- 1 X T ^X(X T WX)- 1 X T WY 

ohj ohj 

(3.12b) 

x t 1 -rdW 

= eJ(X T WXr 1 X T ^-(Y - Xa) (3.12c) 

dhj 



where a = (X T WX) 1 X T WY is the coefficient vector for the local linear fit (and we have 
dropped the dependence on the local point x in the notation). 

Note that the factor \H\~ l = Ylt=i V^i m the kernel cancels in the expression for m, and 
therefore we can ignore it in our calculation of Zj. Assuming a product kernel we have 

(d d \ 

II /f((X y - x^/hj), . . . ,]jK((X nj - x 3 )lh 3 ) (3.13) 
3=1 j=l J 

and dW/dhj = WLj where 

L 3 = diag j ^log^A^-^)//^ ^ 01og/r((X BJ --s 3 Q/fe 3 -) \ (g _ M) 

and thus 

Z 3 = eJ(X T WXy 1 X T WL j (Y - Xa) = ej BL 3 (I - XB)Y = G j (x,h) T Y (3.15) 
where B = (X T WX)- 1 X T W. 

The calculation of Lj is typically straightforward. As two examples, with the Gaussian kernel 
K{u) = exp(— u 2 /2) we have 

L j = ^ dia S - ^) 2 , • • • » ( X nj - ^) 2 ) (3-16) 

and for the Epanechnikov kernel K(u) = (5 — re 2 ) < y/5) we have 

L ' = k dlag ( s-S-S/ftj - *'l £ ^ ■ ■ • • (3 ' 17a) 



IV. Examples 

In this section we illustrate the rodeo on some examples. We return to the examples later 
when we discuss estimating a, as well as a global (non-local) version of the rodeo. 

A. Two Relevant Variables 
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Figure 3: Rodeo run on synthetic data sets, showing average bandwidths over 200 runs 
(left), final bandwidths with standard errors (right), and bandwidths on a single run of 
the algorithm (center). In the top plots the regression function is m(x) = bx\x\ with 
d — 10, n — 500, and er = .5 and in the lower plots the regression function is m(x) = 
2(xi + l) 3 + 2 sin(10x2), d = 20, n = 750, and a — 1. The figures show that the bandwidths 
for the relevant variables x\ and X2 are shrunk, while the bandwidths for the irrelevant 
variables remain large. 
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Figure 4: Squared error of the estimator on the previous examples, m(x) = 5x\x\ (left) and 
m(x) = 2{x\ + l) 3 + 2sin(10x2) right. For each plot, the left six boxplots show the risk 
in different dimensions (d = 5, 10, 15, 20, 25, 30) when using a single bandwidth, chosen by 
leave-one-out cross validation. The right six boxplots show the squared error on the same 
data with bandwidths selected using the rodeo. 

In the first example, we take m(x) = bx\x\ with d = 10, a — .5 with Xi ~ Uniform(0, 1). 
The algorithm is applied to the local linear estimates around the test point x = (~, . . . , |), 
with (3 = 0.8. Figure El shows the bandwidths averaged over 200 runs of the rodeo, on data 
sets of size n = 750. The second example in Figure El shows the algorithm applied to the 
function m(x) = 2(x± + l) 3 + 2sin(10x2), in this case in d = 20 dimensions with a = 1. 

The plots demonstrate how the bandwidths hi and /12 of the relevant variables are shrunk, 
while the bandwidths of the relevant variables tend to remain large. 

B. A One- Dimensional Example 

The next figure illustrates the algorithm in one dimension. The underlying function in this 
caseism(a;) = (1/x) sin(15/a;), and n = 1500 data points are sampled as x ~ Uniform(0, 1) + 
|. The algorithm is run at two test points; the function is more rapidly varying near the test 
point x = 0.67 than near the test point x = 1.3, and the rodeo appropriately selects a smaller 
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o 



1 .2 1 .4 



Test point 



Figure 5: A one-dimensional example. The regression function is m{x) = (1/x) sin(15/z), 
and n = 1500 data points are sampled, x ~ Uniform(0, 1) + |. The left plot shows the 
local linear fit at two test points; the right plot shows the final log bandwidth, log^ h+, 
(equivalently, minus the number of steps) of the rodeo over 50 randomly generated data 
sets. 

bandwidth at x = 0.67. The right plot of Figure |3] displays boxplots for logarithm of the 
final bandwidth, in the base 1/(3 (equivalently, minus the number of steps in the algorithm), 
averaged over 50 randomly generated data sets. 

The figure illustrates how smaller bandwidths are selected where the function is more rapidly 
varying. However, we do not claim that the method is adaptive over large classes of function 
spaces. As discussed earlier, the technique is intentionally a greedy algorithm; adapting to 
unknown smoothness may require a more refined search over bandwidths that does not scale 
to large dimensions, and is out of the scope of the current paper. 



C. Estimating a 



The algorithm requires that we insert a n estimate a of a in ()3.9j) . An estimator for a can 
be obtained by generalizing a method of iRicd f)1984r ) . For i < £, let 



d 



il' 



\Xi — Xe 



(4.1) 
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Fix an integer J and let £ denote the set of pairs (i, £) corresponding the J smallest values 
of da. Now define 

2 -tfE^-^) 2 - ( 4 - 2 ) 



7 2J 

i,£eS 



Then, 



where 



E(a 2 ) = a 2 + bias (4.3) 
drrij (x) 



bias < D sup 



(4.4) 



with D given by 



Z) = max \\Xi — XA\. (4.5) 

uee 



There is a bias-variance tradeoff: large J makes a 2 positively biased, and small J makes a 2 
highly variable. Note, however, that the bias is mitigated by sparsity (small r). 

A more robust estimate may result from taking 



-fnuHlianj Y )) (4.6) 



where the constant comes from observing that if X t is close to JQ, then 

\Yi-Y e \ ~ \N(0,2a 2 )\ = y/2a\Z\, (4.7) 
where Z is a standard normal with E|Z| = y/2/ir. 

Now we redo the earlier examples, taking a as unknown. Figure |H1 shows the result of 
running the algorithm on the examples of Section 14. Al however now estimating the noise 
using estimate (|4.6jl . For the higher dimensional example, with d = 20, the noise variance is 
over-estimated, with the primary result that the irrelevant variables are more aggressively 
thresholded out; compare Figure EH to Figure El 

Although we do not pursue it in this paper, there is also the possibility of allowing a(x) to 
be a function of x and estimating it locally. 



D. Computational Cost 

When based on a local linear estimator, each step of the rodeo algorithm has the same 
computational cost as constructing a single local linear fit. This is dominated by the cost of 
constructing the matrix inverse (X T WX)~ 1 in equation (|3.15J) . Since the derivative needs 
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Rodeo Step 



Variable 




Figure 6: Rodeo run on the examples of Section 14. A[ but now estimating the noise using 
the estimate a discussed in Section I4.CI Top: a — .5, d — 10; bottom: a — 1, d — 20. In 
higher dimensions the noise is over-estimated (center plots), which results in the irrelevant 
variables being more aggressively eliminated; compare Figure 01 
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to be computed for every variable, the algorithm thus scales as 0(d 4 ) in the dimension d. 
Implemented in R, the 20 dimensional example in Figure El takes 4 hours, 4 minutes and 
40 seconds for 200 runs, or 73.4 seconds per run, when executed on a 1.5 GHz PowerPC 
Macintosh laptop. Although we focus on local linear regression, it should be noted that very 
similar results are obtained with kernel regression, which requires no matrix inversion. Using 
kernel regression, the same example requires 12 minutes and 33 seconds, or 3.7 seconds per 
run. 



V. Properties of the Rodeo 

We now give some results on the statistical properties of the hard thresholding version of 
the rodeo estimator. Formally, we use a triangular array approach so that m(x), f(x), d 
and r can all change as n changes, although we often suppress the dependence on n for 
notational clarity. We assume throughout that m has continuous third order derivatives in a 
neighborhood of x. For convenience of notation we assume that the covariates are numbered 
such that the relevant variables Xj correspond to 1 < j < r and the irrelevant variables Xj 
correspond to r + 1 < j < d. 

A key aspect of our analysis is that we allow the dimension d to increase with sample 
size n, and show that the algorithm achieves near optimal minimax rates of convergence if 
d = 0(logn/ log log n). This hinges on a careful analysis of the asymptotic bias and variance 
of the estimated derivative Zj, taking the increasing dimension into account. We conjecture 
that, without further assumptions, d cannot increase at a significantly faster rate, while 
obtaining near optimal rates of convergence. 

The results are stated below, with the complete proofs given in Sectional 

We write Y n = Op(a n ) to mean that Y n = Op(b n a n ) where b n is logarithmic in n. As noted 
earlier, we write a n = fi(6 n ) if liminf n y. > 0; similarly a n = fl(b n ) if a n = Q(b n c n ) where 
c n is logarithmic in n. 

; ' I denote the Hessian of mix), let h; denote the ? th bandwidth at step 
OJ 

t of the algorithm and denote the bandwidth matrix by = diag((/if' ) ) 2 , . . . , (h^) 2 ). 
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Similarly, let H® = diag((/if ; ) 2 , • • • > (hr>) 2 ). Define 



(t) 



dh, 



■E [m H( t) (x) - m(x) \X 1 ,...,X n }. 



(5.1) 



Our first result is an asymptotic expansion of the bias of the estimator. 



Theorem 5.1. Suppose that x is interior to the support of f and that K is a product 
kernel. Assume that 

d — O (log n/ log log n), r = 0(1), (5.2) 



Co 



for some c > and 



for some < a < 1. Define 



log log n 



■(*) 



iy 2 m jj (x)h® 



3 <r 



- tr ( ^ 2 ( V,- log /(x)) 2 hf j > r. 



where u 2 is defined in equation (13. 5j) . Suppose that 

d 2 f(x] 



max sup 

a, 6 ^ 



and 



Then, for T n < C\ logn, 



max sup 



dx a dxb 
d 3 f(x) 



dx a dxbdx c 



0(l/d 2 



0(l/d 3 ). 



max \u,f -Lf\=0 P [{hf) 2 /d 

l<s<T n J J V J 

l<j<d 



(5.3) 
(5.4) 

(5.5) 



(5.6) 



(5.7) 



(5- 



The next result characterizes the variance v*p = Var(Zj(£) \Xi, . . . , X n ) 



Theorem 5.2. Assume the conditions of Theorem \5.1\ Let 



s%t) 



C 



d x 



(5.9) 
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where hj is the j th bandwidth at step t, with C = o 2 J K 2 (u) duj f(x). Then, 



P I max 

l<s<T„ 
■l<j<d 



(*) 



> e 



(5.10) 



for all e > 0. 



Our main theoretical result characterizes the asymptotic running time, selected bandwidths, 
and risk of the algorithm. In order to get a practical algorithm, we need to make assumptions 
on the functions m and /. 



(Al) For each j > r, 



(A2) 



V i tog/(x) = O^V (5.11) 



liminf min \rrijj(x)\ > 0. (5-12) 

n 1<j'<7" 



Explanation of the Assumptions. To give the intuition behind these assumptions, recall from 
Theorem 15.11 that 

{Ajhj + Op(hj) j < r . 
Bjhj + op(hj) j > r 

where 

Aj = P2m jj {x), Bj = - tr(/f^ R )z/ 2 2 (V i log/(x)) 2 . (5.14) 

Moreover, fij w when the sampling density / is uniform or the data are on a regular 
grid. Consider assumption (Al). If / is uniform then this assumption is automatically 
satisfied since then fMj(s) ~ for j > r. More generally, fij is approximately proportional to 
(Vj logf(x)) 2 for j > r which implies that \fij\ ~ for irrelevant variables if / is sufficiently 
smooth in the variable Xj. Hence, assumption (Al) can be interpreted as requiring that / is 
sufficiently smooth in the irrelevant dimensions. If d is constant and / is smooth, then this 
assumption is implied by Theorem 15.11 However, we allow for the possibility that d grows 
with n. 

Now consider assumption (A2). Equation (j5.1H|) ensures that fij is proportional to hj\rrijj(x)\ 
for small hj. Since we take the initial bandwidth ho to be decreasingly slowly with n, (A2) 
implies that \fij(h)\ > chj\m,jj(x)\ for some constant c > 0, for sufficiently large n. 
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Theorem 5.3. Assume the conditions of Theorem \5.1\ and suppose that assumptions (Al ) 
and (A2) hold. In addition, suppose that A min = minj< r \mjj(x)\ = fi(l) and A max = 
maxj< r \rrijj(x)\ = 0(1). Then the rodeo outputs bandwidths h* that satisfy 

P (h* = h for all j >r) — ► 1 (5.15) 

and for every e > 0, 

P ( r2 - 1 /(4+r)-e < < n -l/(4+r)+e for aR ■ < ^ „ j _ (5.16) 



Corollary 5.4. Under the conditions of Theorem 15..'^ the risic of the rodeo estimator 
satisfies 

K{h*) = E J{m h *{x) -m{x)fdx = P ( n - 4/ ^ +r)+e ) (5.17) 

for every e > 0. 

The corollary follows easily from the previous results using the bias- variance decomposition. 
Details are given in Section [7| In case d = 0(1), then the result can be strengthened to 

K(h*) = P (n- 4 /* 4 ^) (5.18) 

so that the optimal rate of convergence is obtained up to logarithmic factors, as if the relevant 
variables were known in advance and isolated. 

Assumption (Al) imposes a condition on the sampling density of the irrelevant variables, 
requiring that it is sufficiently smooth in the irrelevant dimensions. Using Theorem 15. H we 
can modify the threshold used by the algorithm in order to remove this assumption. 

Theorem 5.5. Consider a modified rodeo algorithm where the threshold is changed to 



Xj = PnP 3t + s j ^2\og(nc n ) (5.19) 

at step t, where p n > satisfies 

T7f — ► oo, ^ — > as n — > oo. (5.20) 

Then the results of Theorem 15. d hold under assumption (A2), without assuming (Al). 
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VI. Extensions and Variations of the Rodeo 



The rodeo represents a general strategy for nonparametric estimation, based on the idea of 
regularizing or testing the derivatives of an estimator with respect to smoothing parame- 
ters. There are many ways in which this basic strategy can be realized. In this section we 
discuss several variants of the basic hard thresholding version of the rodeo, including a soft 
thresholding version, a global rather than local bandwidth selection procedure, the use of 
testing and generalized cross validation, and connections to least angle regression. Further 
numerical examples are also given to illustrate these ideas. 



A. Subtracting off a Linear Lasso 



Local linear regression is a nonparametric method that contains linear regression as a special 
case when h — > oo. If the true function is linear but only a subset of the variables are relevant, 
then the rodeo will fail to separate the relevant and irrelevant variables since relevance is 
defined in terms of departures from the limiting parametric model. Indeed, the results depend 
on the Hessian of m which is zero in the linear case. The rodeo may return a full linear fit 
with all variables. A simple modificatio n fixes this problem. First, do linear variable selection 
using, say, the lasso ((Tibshiranil . 11996). Then run the rodeo on the residuals from that fit, 
but using all of the variables. An example of this procedure is given below in Section f6.PI 



B. Other Estimators and Other Paths 
We have taken the estimate 

D 3 (h) = Z J (h)I(\Z J (h)\>X J ) (6.2) 

with the result that 

fh(x) = fhh (x) — / (D(s), h(s))ds = rhh*(x). (6.3) 
Jo 

There are many possible generalizations. First, we can replace D with the soft-thresholded 
estimate 

Djit) = sign^W) (\Zj(h)\ (6.4) 

where the index t denotes the t th step of the algorithm. Since hj is updated multiplicatively 
as hj <— (3hj, the differential dhj(t) is given by dhj(t) = (1 — f3)hj. Using the resulting 
estimate of D(t) and finite difference approximation for h(t) leads to the algorithm detailed 
in Figured 
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Rodeo: Soft thresholding version 

1. Select parameter < (3 < 1 and initial bandwidth ho, satisfying 1 < h < \og i ^ d n, for 
a fixed constant £. Let c n be a sequence satisfying c n = 0(1). 

2. Initialize the bandwidths, and activate all covariates: 



(c) Initialize step, t — 1. 
3. While „4 zs nonempty 

(a) Set dhj(t) — 0, j — 1, . . . ,d. 

(b) Do for each j e .4.: 

(1) Compute the estimated derivative expectation Zj and Sj. 

(2) Compute the threshold Xj = s.,- a/2 log (nc n ). 

(3) If \Zj\> Xj, set c%(£) = (1 - (3) hj and hj <- 
otherwise remove j from *4. 



(a) /ij = /io, j = 1,2, . . .,d. 

(b) ^ = {1,2,. ..,d} 



(4) Set -Dj(t) = siga(Zj(h)) (\Zj(h)\ - Xj) 
(c) Increment step, t <—t + l. 



4. Output bandwidths h* 



(hi, ... , h d ) and estimator 




(6.1) 



Figure 7: The soft thresholding version of the rodeo. 
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Figure 8: Comparison of hard and soft thresholding. Left: m(x) = 5x\x\, d = 10 and 
a = 0.5; right: m(x) = 2(xi + l) 3 + 2sin(10x2), d = 10 and a = 1. The hard and 
soft thresholding versions of the rodeo were compared on 100 randomly generated datasets, 
with a single random test point x chosen for each; (3 = 0.9. The plots show two views of 
the difference of losses, (mh ar d(x) — m(x)) 2 — (rh soit (x) — m(x)) 2 ; positive values indicate an 
advantage for soft thresholding. 

Figure |H1 shows a comparison of the hard and soft thresholding versions of the rodeo on the 
example function m{x) = 2(xi + l) 3 + 2sin(10x2) in d = 10 dimensions with a = 1; (3 was set 
to 0.9. For each of 100 randomly generated datasets, a random test point x ~ Uniform(0, l) d 
was generated, and the difference in losses was computed: 

(m haTd (x) - m(x)) 2 - (m soft (x) - m(x)f . (6.5) 

Thus, positive values indicate an advantage for soft thresholding, which is seen to be slightly 
more robust on this example. 

Another natural extension would be to consider more general paths than paths that are 
restricted to be parallel to the axes. We leave this direction to future work. 
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C. Global Rodeo 



We have focused on estimation of m locally at a point x. The idea can be extended to carry 
out global bandwidth and variable selection by averaging over multiple evaluation points 
xi,...,Xk- These could be points interest for estimation, could be randomly chosen, or 
could be taken to be identical to the observed XjS. 

Averaging the ZjS directly leads to a statistic whose mean for relevant variables is asymp- 
totically k~ 1 hjY^ l 'i = i m jj( x i)- Because of sign changes in rrijj(x), cancellations can occur 
resulting in a small value of the statistic for relevant variables. To eliminate the sign can- 
cellation, we square the statistic. Another way of deriving a global method would be to use 
the statistic sup z |Z*(x)|. 

Let X\ : . . . , Xk denote the evaluation points. Let 

n 

Z j (x i )=Y t YsG j (X 8 ,x i ). (6.6) 



s=l 



Then define the statistic 



i=i 

where Pj = GjGj, with Gj(s,i) = Gj(X s ,Xi). 

If j G R° then we have E(Zj(xj)) = o(l), so it follows that, conditionally, 

2 

Wi) = ^tr(P,)+o P (l) (6.8a) 

V(T,) = ^tr(P^) + o P (l). (6.8b) 

We take the threshold to be 

A, = °- tv(Pj) + 2^- v / tr(P,P,)log(c n n) . (6.9) 



Note that if j > r, we have 



but for j < r we have 



HT 3 ) = lY. s ^)+°( h o) (6-10) 

i 
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We give an example of this algorithm in the following section, leaving the detailed analysis 
of the asymptotics of this estimator to future work. 



D. Greedier Rodeo and LARS 



The rodeo is related to least angle regression (LARS) ([Efron et all 120041 ) . In forward stage- 
wise linear regression, one performs variable selection incrementally. LARS gives a refinement 
where at each step in the algorithm, one adds the covariate most correlated with the resid- 
uals of the current fit, in small, incremental steps. LARS takes steps of a particular size: 
the smallest step that makes the largest correlation equal to the next-largest correlation. 



Efron et al.l (|2004f ) show that the lasso can be obtained by a simple modification of LARS. 



The rodeo can be seen as a nonparametric version of forward stagewise regression. Note first 
that Zj is essentially the correlation between the Yis and the Gj(Xi, x, h)s (the change in the 
effective kernel). Reducing the bandwidth is like adding in more of that variable. Suppose 
now that we make the following modifications to the rodeo: (i) change the bandwidths one 
at a time, based on the largest Z* = Zj/Xj, (ii) reduce the bandwidth continuously, rather 
than in discrete steps, until the largest Zj is equal to the next largest. This version can then 
be thought of as a nonparametric formulation of LARS. 

In fact, we can go further and embed the rodeo within LARS to get a fast nonparametric 
method. We do this by replacing the derivatives of the fit in the rodeo with differences. 
Then we iterate variable selection with bandwidth selection. 



• Set h = (h , . . . , h ). Define <i-dimensional pseudo-covariates X iy i — 1, . . . , n, by 

X i (j) = G j (X i ,x,h), j = l,...,d. (6.12) 

Now run the LARS algorithm, regressing the Y^s on the pseudo-covariates, up to 
some pre-defined stopping point. This step essentially chooses relevant variables at the 
resolution of the starting bandwidth h = (h , . . . , h ). 

• Define new pseudo-covariates X i: i = 1, . . . , n, by 

Xi(j) = Gj(Xi, x, ti) - GjiXi, x,h), j = 1, . . . , d (6.13) 

where h' has hj replaced by (3hj. Note that adding the j th covariate corresponds to 
reducing the bandwidth from hj to f3hj. Now run the LARS algorithm up to some 
pre-defined stopping point. 
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Greedy Rodeo Step 



80 



100 



Figure 9: Greedy rodeo on the diabetes data, used to illustrate LARS fEf ron et all I2004). 
A set of k = 100 of the total n = 442 points were sampled (d = 10), and the bandwidth for 
the variable with largest average \Zj\/Xj was reduced in each step. 



• Repeat the last step until a stopping criterion is satisfied. 

One advantage of this method is that it can be implemented using existing LARS software. 
We leave the development of the theory for this approach to future work. Some examples of 
the greedy version of this algorithm follow. 



D.l Diabetes example 

Figure El shows the result o f running the greedy version of the rodeo on the diabetes dataset 



used by 



Efron et al.| (|2004h to illustrate LARS . The algorithm averages Z* over a randomly 
100 data points, and reduces the bandwidth for the variable with the 



chosen set of k 

largest value; note that no estimate of a is required. The resulting variable ordering is seen 
to be very similar to, but different from, the ordering obtained from the parametric LARS fit. 
The variables were selected in the order 3 (body mass index), 9 (serum), 7 (serum), 4 (blood 
pressure), 1 (age), 2 (sex), 8 (serum), 5 (serum), 10 (serum), 6 (serum). The LARS algorithm 
adds variables in the order 3, 9, 4, 7, 2, 10, 5, 8, 6, 1. One notable difference is in the position 
of the age variable. 
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Figure 10: Left: A typical run of the greedy algorithm on Turlach's example. The bandwidths 
are first reduced for variables £2,^3,^4,^5, and then the relevant, but uncorrelated with Y 
variable X\ is added to the model; the irrelevant variables enter the model last. Right: 
Histogram of the position at which variable x\ is selected, over 100 runs of the algorithm 



D.2 Turlach's example 



In the discussion to the LARS paper, Berwin Turlach ((Turlachl . |200J) gives an interesting 
example of where LARS and the lasso fails. The function is 



Y 



+ X 2 + X 3 + X A + X 5 + e 



(6.14) 



with ten variables X« ~ Uniform (0, 1) and cr = 0.05. Although X\ is a relevant variable, it 
is uncorrelated with Y, and LARS and the lasso miss it. 

Figure E3 shows the greedy algorithm on this example, where bandwidth corresponding to 
the largest average Z* is reduced in each step. We use kernel regression rather than local 
linear regression as the underlying estimator, without first subtracting off a Lasso fit. The 
variables x%, x%, x±, x§ are linear in the model, but are selected first in every run. Variable 
xx is selected fifth in 72 of the 1 00 runs; a typic al run of the algorithm is shown in the left 
plot. In contrast, as discussed in iTurlachl (|2004l ). LARS selects x\ in position 5 about 25% 
of the time. 
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Figure 11: The global rodeo averaged over 10 runs on Turlach's example. The left plot shows 
the bandwidths for the five relevant variables. Since the linear effects (variables two through 
five) have been subtracted off, bandwidths h%, h 3 , hi, h§ are not shrunk. The right plot shows 
the bandwidths for the other, irrelevant, variables. 

Figure ITT1 shows bandwidth traces for this example using the global algorithm described in 
Section l"6. CI with k = 20 evaluation points randomly subselected from the data, and a taken 
to be known. Before starting the rodeo, we subtract off a linear least squares fit, and run the 
rodeo on the residuals. The first plot shows hi, . . . , h*,. The lowest line is hi which shrinks 
the most since m is a nonlinear function of X\. The other curves are the linear effects. The 
right plot shows the traces for h§,..., hio, the bandwidths for the irrelevant variables. 



VII. Proofs of Technical Results 

In this section we give the proofs of the results stated in Section 
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A. Proof of Theorem \5.1\ 



We follow the setup of iRuppert and Wandl (J1994J) except for two differences: the irrele- 
vant variables have different leading terms in the expansions than relevant variables and 
d is increasing; this requ ires a more detailed asymptotic analysis than is carried out by 
Ruppert and Wandl ([l994h . 



Let H 



H R 




denote the Hessian of m(x), let hf' denote the j th bandwidth at 



step t and denote the bandwidth matrix by = diag((/i^ 



(t)\2 



' \'"d 



H® = diag((^ 



( ^ 2 ,...,(^) 2 



(K L, y 2 ). Define 



(*) 
Mi 



_d_ 

dh. 



■E [m H( t) (x) - m(x) \X 1 ,...,X d ] 



Similarly, let 
(7.1) 



In the following we sometimes supress the superscript t. 
Let Vm be the gradient of m at x, and let 

Q = ((Xi - x) T H(X 1 -x),...,(X n - x) T H(X n - x)) T . 
Note that Vm and Q are only functions of the relevant variables. Then 

m{Xi) = m(x) + {Xi - xfVm + ^Qi + R, 
where, using multi-index notation, 

Ri = l^ Xi ~ x)a J D * m ((1 ~ s)x + sXi) ds = \^ Xl ~ x ) aD ° m & 



(7.2) 



(7.3) 



(7.4) 



|a|=3 



|q|=3 



where & )Q , is some point between Xi and x. 
So, withM= (m(X 1 ),...,m(X n )) T , 



M = X-r 



m[x ) 
Vm 



-Q + R 



(7.5) 



where R = . . . , R n ) T ■ Since S x X x (m(x), Vm) T = m(x), the bias b(x) = E(m#(a;)) 
m(x) is given by 



6(rr) = e\S x M — m(x) = —S X Q + S X R 



-e{ {Xi V^Q + (Xj W^X,)" 1 ^ 



(7.6a) 
(7.6b) 
(7.6c) 
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where T n = n~ x [X T x W x X x ) and T n = n~ 1 {X^W x Q). 
Analysis o/T n . We write 



T 



IV" w 



^11 ^12 

^21 ^22 



where 



(7.7a) 



(7.1 



Now, the variance of Wi can be bounded as 

d - Yd 



v »w) * » = n n ^ (V v ) * n £ 



(7.9) 



where = sup u i^(w). Therefore, 



Var(A u ) < ^TT^<-^7 = — • 



(7.10) 



Also, < A s . Note that (3 T " d nh d /C% = Q (y-<*-i°gi°gi°gn/i°gi°gn) ; where a ig the constant 
appearing in ([5.4)1 . Hence, by Bernstein's inequality, for any 5 > 0, 



P(|^n -E(An)| > e) < 2exp 



H( 



ne 2 \ 
A + Ae/3 J 



f 1 / ne 2 /?^ 
f 1 / ne 2 (5 dTn hi 



1 



< 2rxp|--n^e 2 



(7.11a) 
(7.11b) 
(7.11c) 
(7.11d) 



Therefore, uniformly for all t and j, Af} = E(A^) + P (l/n^^- 8 ) for any 5 > 0. Let D 
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denote the Hessian of /. For some v between x and v, 

1 



11, 



K(H~ 1/2 (x - u))f(u) du 



(7.12a) 
(7.12b) 



hih 2 ■ ■ -h d 

K(v)f(x - H 1/2 v) dv 

f(x)- J K(v)(H 1 / 2 v) T Vf(x)dv + ^ J K(v)(H 1 / 2 v) T D 2 (v)H 1 / 2 vdv 

(7.12c) 

f(x) + ^ j K(v)(H 1/2 v) T D 2 (v)H 1/2 vdv (7.12d) 
f(x)+n (7.12e) 



where from ()5.6|) 



n 



- f K(v)(H 1/2 v) T D 2 (v)H 1/2 vdv = O ( d h 2 max sup 

2 J \ a,b x 



d 2 f{x) 



dx a dx b 



Therefore we have, uniformly over 1 < t < T n , 

A u = f{x) + 0({hf) 2 /d) 



O {{hf ) 2 /d) 

(7.13) 

(7.14) 



Next consider A 2 i. By a similar argument as above, uniformly for all t and j, A® 
E(4?) + P (l/n^- s ) for any 5 > 0. Also, 

1 



E(A 



21; 



K(H~ 1/2 (x - u)){u - x)f{u) du 



h l h 2 ••■h d 
K(v)H 1 l 2 vf(x + H l l 2 v)dv 



(7.15a) 
(7.15b) 



f{x) J K{v)H l ' 2 vdv + J K(v)H 1/2 v(H l/2 vfD dv 







+~ J K{v)H l t 2 v{ k H 1 / 2 v) T D 2 H 1 f 2 vdv 

V v ' 

o 

h a h b h c v a v b v c D 3 (a, b, c) dv 



(7.15c) 



where D 2 and D3 are the arrays of second and third derivatives of /, the latter being evaluated 
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at some point between x and u. So, 

E(A 21 ) = [ K{v)H 1/2 v{H l/2 v) T Ddv + ^ J K(v)H 1/2 vJ2 h a h bh c v a v b v c D 3 (a,b,c)dv 

(7.16a) 
(7.16b) 



abc 



v 2 {K)HD + r 2 



where 



r 2 = ~ J K(v)H 1/2 v h a h b h c v a v b v c D 3 (a, b, c) dv 

"' abc 



O [ d 2 max sup 



d 3 f( 



x 



dx a dx b dx c 



o(((hf>Y/d)i 



(7.17a) 
(7.17b) 



using (|5.7j) and the definition of v 2 in equation (|3.5|) . Therefore, A 21 = v 2 {K)HD + 

o P ({{hf y/d)i). 

Now we turn to A 22 . Again we have convergence to its mean and 



E(A 
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J K(v)H 1/2 v(H 1/2 v) T f(x + H 1/2 v) dv (7.18a) 
f(x) J K(v)H l l 2 v(H l ' 2 v) T dv+ l - J K(v)H 1/2 v(H 1/2 v) T (H 1/2 v) T D dv 
+~ J K^H^v^vfiH^vfD^H^v) (7.18b) 



= u 2 f(x)H + R 3 
where R 3 is a matrix whose off-diagonal elements are of order 



O [ h max sup 

a, b -y. 



d 2 f(x) 



dx a dx b 

and whose diagonal elements are of order 

d 2 f(x) 



oUhff/d 2 



(7.18c) 



(7.19) 



O I <i/i maxsup 

a, b ™ 



dx a dx h 



O ({hff/d) 



Thus, 



We can now write 



T 

J- f 



f(x) + Tx u 2 D T H + rj 
u 2 H T D + r 2 v 2 f(x)H + R 3 

T n = A + vv T + E 



(7.20) 



(7.21) 



(7.22) 



31 



where 

V2f(x)H 



A=\ f{x) ~ l ° I, (7.23) 



v T = (1, u 2 D T H) (7.24) 

and 

To invert the matrix we use the following result. Let B be an invertible matrix and let A 
be any matrix such that |A| < e\B\ (interpreted elementwise) . Then, 

\(B + A)" 1 - B- 1 ] < e|fl _1 | \B\ IB' 1 ] + 0(e 2 ) (7.26) 

where the absolute values and inequalities are elementwise. Thus, 

IT; 1 - (A + vv T )- l \ <e\(A + vv T Y l \ \A + vv T \ \{A + vv T )~ l \ + 0(e 2 ) (7.27) 

where e = max^ \E t j\. It now follows from ()7.13|h ()7.17b|) . ()7.19|) . ()7.20|) and a union bound 
that 

max \r-\j,k)- (A + vv T )-\j, k)\ =o P (h*). (7.28) 

To compute (A + v v T )~ 1 we use the matrix inversion lemma (Woodbury formula) 

(A + vv T )- 1 = A' 1 - + v T A~ 1 v)~ 1 v T A~ 1 . (7.29) 

Now, 

A' 1 = L \ , (7.30) 



J 7rV' 

"2 fix), 



and 



t - 1 f(x) h , oD T HD f(x) , „, „o . N 

l + vAv = JW^i + = TW 1 ^ 1 + maxZ, 2» P.3i) 

so that 

(l+^A-^)- 1 = Z!£-lI(i + (i)). ( 7 - 32 ) 

Also, 

1 D T 



A-^A- 1 = ( (/W- 1 ) 2 ] . (7.33) 

32 



Finally, we have that 

1 D T 



r- n 1 = (A + vv T )- 1 + o P (l)=[ Ri Z® U(x) _ 1) )+o P {hl). (7.34) 



P(x) u 2 f(x) p(x) 



This completes the analysis of T n . 
Analysis ofT n = ^XjW x Q. We can write 



1 o = ( l Er=i - *rnx* -x) \ = / 7i 

1 •'•)1>'H.V, - x) T H(X t -x) I 72 



Now, 

E( 7 i) = y K{v){H r ' 2 v) T U{H^ 2 v)f{x + H^ 2 v)dv (7.36a) 

= /(x) J K(v)(H 1/2 v) T H(H 1/2 v)dv + J K(v)(H 1/2 v) T H(H 1/2 v)D T (H 1/2 v)dv 

+ i y Kiv^H^vfHiH^v^H^vfD^H^dv (7.36b) 

= i^/(x)tr(fZ"Wfl) + 0(dr/^max£) 2 (a,6)) (7.36c) 

ab 

= u 2 f(x)tr(HH R ) + 0((hf ) ) 4 /d). (7.36d) 

The stochastic analysis of 71 from its mean is similar to the analysis of A 12 and we have 
71 = is 2 f(x) tr(tfWfl) + o P ((/if) 4 /^) • 

Next, 

E( 72 ) = J K(v)(H 1/2 v)(H 1/2 vfH(H 1/2 v)f(x + H 1/2 v)dv (7.37a) 

= /(x) J(H 1/2 v)K(v)(H 1/2 vfH(H 1/2 v)dv 

+ y K{v){H 1 / 2 v){H 1 ' 2 v) T n{H 1 / 2 v)D T {H 1 / 2 v)dv 

+ l - J K{v){H l l 2 v){H l l 2 v) T H{H l l 2 v){H^ 2 v) T Dl{H l l 2 v)dv (7.37b) 

= y K(v)(H 1/2 v)(H 1/2 v) T H(H 1/2 v)D T (H 1/2 v)dv 

+ Kiv^H^v^H^vfHiH^v^H^vfD^H^dv (7.37c) 

= A + 0(drhl maxD 2 (a, 6) 1) (7.37d) 

= A + O((h 5 /d)l) (7.37e) 
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where A is a d- vector with j th component 

r r 

Thus, 



k=l 



k=l 



v 2 f(x)tr(HH R ) 
A 



+ P {(hl/d)l). 



(7.38) 



(7.39) 



Analysis of remainder ejT n l ^X^W x R. We can write 



1 EIU 

7 l T:U(X l -x)W l R t 



-XlW x R 



n 



Si 
S 2 



(7.40) 



Let m 3 denote the array of third derivatives of m and let v 3 - = hjVj. Then, 
E(<Ji) = J K(v)m 3 (g) ^ v a v b v c f(x + v)dv 

l<a,b,c<r 

= f(x) J K(v)m 3 (0 

V a VbV c dv + J K(v)m 3 (0 Yl 



(7.41a) 



l<a,b,c<r 



l<a,b,c<r 



77l 3 (x) / K(v) Y V " 



v b v c (v 1 D)dv + o[(h { ;>) 



VbV c (v D) dv 

(7.41b) 
(7.41c) 



l<a,b,c<r 



and so 



Similarly, 



\E(5 1 )\ = oLup\m 3 (x)\ sup\D(x)\ £ h)h\ j = O ( £ h)hl 

V j,k<r / \j,k<r j 



E(5 2 ) = 



l<a,b,c<r 

f(x) J K(v)m 3 (£)v ^2 v a v b v c dv + o ({h 



(*)\4 
3 > 



l<a,b,c<r 



O sup |m 3 (x)| ^2 h]h 

j,k<r 



Hence, 



e^T-y-X^W x R = P (j2hy k ) 

\j,k<r / 



(7.42) 

(7.43a) 
(7.43b) 

(7.43c) 
(7.44) 



34 



Putting all of this together, we conclude that the leading term of b(x) is 

h^^-lm (7 - 45) 

where A is given in (|7.38jl . Taking the derivative with respect to bandwidth hj gives the 
statement of the theorem. 



Remark 7.1. Special treatment is needed if a; is a boundary point; see Theorem 2.2 of 
Ruppert and Wand (1994). 



B. Proof of Theorem \5. 6 Ji 

Let t denote the first row of S x . Then, with £ ~ N(0, a 2 ), 

fn H (x) = ^2t i Y i = ^2eMX i )+^2e i e i (7.46a) 



i V i 

£ ^ w + 7 h A i t ( 7 - 46c ) 

^ ^Jnh x ■■■h d 



where 



Thus, 



A = Lit ■■■h i £ Pi- (7.47) 
VMZ mXl , ...,*„) = „=Var (i- j7 ^=^) ■ (7-48) 



Now we find an asymptotic approximation for A. 
Recall that 

1 ,^r„. „ V 1 1 



S x = [ -X T X W X X X -X T X W X (7.49) 
n In 



and from our previous calculations 

-1 / 1 D< 



^n 1 = ( l -X T x W x X, x ) g -4 ^ mx) - 1) )+Op((hf ) ) 2 /d). (7.50) 

V J V P{x) u 2 f(x) p(x) I 
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Note that £\ t 2 is the (1,1) entry of S X S?. But 

S X S? = {T-'-XlW x ){T-'-XlW x ) T (7.51a) 
n n 

= ^T^X^W^T' 1 (7.51b) 

1 iXiiX, - x)Wf - x)(X t - xfW 2 1 ^ ^ ' 



n 



So A 2 is the (1, 1) entry of 

r -i( h -^E,w 2 ^E«-^m 2 

= T- 1 ^ 11 ^W 1 (7.52b) 

Next, as in our earlier analysis 

a n = J K 2 {v) f{x - H 1/2 v) dv (7.53a) 

= f(x) j ' K 2 {v)dv + P {h 2 Jd). (7.53b) 

Similarly, 

a 2 i = v%HD + Op ({{hf ) 4 /d)l) (7.54) 
where z/2/ = / vv T K 2 (v) dv and 

« 22 = f(x)9 2 H + E (7.55) 
where maxjj Eij = Op f(/t^) 4 Y Hence, the leading order expansion of A 2 is given by 

fK 2 (v)dv D T HD . [K 2 (v)dv ,„ r „, 

Taking the derivative with respect to hj we thus conclude that 

Vax( W ,...,X„) = ^f^'.^ d^d)). (7.57) 

C. Proof of Theorem \5JA 

This result characterizes the asymptotic running time, selected bandwidths, and risk of the 
algorithm. We restate the assumptions made on the functions m and / in Section 
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(Al) For each j > r, 

V 3 \ogf{x) = O f ' 



1/4 



n 



(A2) 



liminf min \rrijAx)\ > 0. 



To prove the theorem we make use of a version of Mill's inequality, modified for 
mean random variables. 

Lemma 7.2. Let Z ~ N((i, a 2 ). If A > 2/i and A 2 > 2a 2 then 

™ / 1 „ i * \ 5 A j A 2 
P(|Z| > A) < — exp 



a " 1 8a 2 

Moreover, if A > 5er then 

A 2 

F(\Z\ > A) < exp ' 



16a 2 

Proof. Without loss of generality, assume /i > 0. Then, 



P(|Z| > A) < 2P(Z > A) = 2P > 

^ 2a / (A - /i) 2 1 ' 



Now = -B(O) + fiB'(fi) for some < u, < u, and 



A — /i [ 2cr 2 J \ a 2 X — /l 
Hence, 



A — /i [ 2(T 2 J \0" 2 A — pl / 

When A > 2/i, 1/(A - //) < 2/A and (A - a) 2 > A 2 /4 so that if A 2 > 2a 2 then 



Thus, 



4a f A 2 W A 2\ 8 f A 2 I 
^(/i) < yexpj-— | + -J <-exp|- — j. 

Iryl ^ 2a ( X 2 ) 8u, f A 2 1 5A f A 2 1 
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The last statement follows since 5xe x2//8 < e x2 / l& for all x > 5. □ 

Proof oi Theorem 15.31 First consider j > r. Let Vt = {j > r : hj = /io/3*} be the set 
of irrelevant dimensions that are active at stage t > 1 of the algorithm. From Assumption 
(Al) it follows that, for sufficiently large n, 

\ j >\Lf\ = 0(hln~ 1/2 ) (7.67) 

for j > r. Also, Xj > 5sj . From Theorems 5.1 and 5.2, we have that, with probability 
tending to one uniformly over j > r, Xj > 2\/j,^\ and X 2 > 2v^\ and /(s^) 2 = 1 + o(l). 
Thus, by Lemma I7~2l 

P(|2^| > Xj, for some j E V t ) < P(|^| > Xj) + o(l) (7.68a) 

< ciexp(-Aj/(16^)) +o(l) (7.68b) 
= ciexp (— A|(l + o(l))/(16s|)) + o(l) -> 0. (7.68c) 

Therefore, with probability tending to one, foj = /iq f° r each j > r, meaning that the 
bandwidth for each irrelevant dimension is frozen in the first step in the algorithm. 

Now consider j <r. By (|5.12j) . and Theorem 15.11 for all large n, 

N > \L?\ - \n - Lf\ = \Lf\ - P {{hfY/d) > chf\mjj{x)\ (7.69) 

for some c > 0. Without loss of generality, assume that chjrrijj(x) > 0. We claim that in 
iteration t of the algorithm, if 

* < ^-Jo^( C ^b^)=to (7.70) 



4 + r " b W \8C\og(nc v 
then 

F(hj = ho/?, for all j < r) — > 1. (7.71) 
To show this, first note that (|7.7U|) can be written as 

i\*(4+r) 2 A2 h d+A 

(3 ) ~ 8Clog(c n n) • { ■ } 

Except on an event of vanishing probability, we have shown above that 



hi h (d ' r) ' 

j>r ■> n 
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So on the complement of this event, if each relevant dimension is active at step s < t, we 
have 

\) 2s 2 j \og{nc n ) _ 2C\og(nc n ) yj 1 



Yl- (7.74a) 



h] " h* ////; 11 h, 

J J J % 

2Clog(nc w ) /iy 4+ -^c^ nm 
< ° 2m f x)2 (7.74c) 



which implies that 
and hence 



crrijjixjhj > 2\j (7.75) 

cm jj( x )hj - Aj > A, = v ^y^-J (7 . 76) 
for each j < r. Now, 

P(rodeo halts) = P(|Z,-| < A,, for all j < r) < P(|Z,-| < A, for some j < r) (7.77a) 

< Y.^\ z i\ <x i)^Y.^ z i <x i) ( 7 - 77b ) 

< ^pf^Lli^^V) (7.77c) 

< J2 F ( Zj ~ — > Cm "^ hj - Xj ) (7.77d) 

< , r (7.77e) 

nc„ a/2 log(nc„) 



Finally, summing over all iterations s < t gives 



r I r 2 n A 2 h d+A 



logi 



< r + 4--/^8Cl g(^) / >Q (?78b) 
nc n a/2 log (nc n ) 

Thus, the bandwidths hj for j < r satisfy, with high probability, 

hj = h ft < h (3 t0 (7.79a) 

= H^-^rJ (7 - 79b) 



V C ^min^O / 
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Now, since dlogd = O(logn) it follows that d = O (log nj log log n) and 

i = 0((loglogn) rf ) (7.80a) 

(log log log n \ 

n lo ^°sn j (7.80b) 

= 0(n e ) (7.80c) 

for any e > 0. In particular, with probability approaching one, the algorithm runs for a 
number of iterations T n bounded from below by 

' ,1-e. 



T n > lr —\og x/0 {n 1 - e a n ) (7.81) 
j_ r 

c 2 AL, 



4 + r 

where 

fi(l) . (7.82) 



mm 



8Clog(nc n ) 

We next show that the algorithm is unlikely to reach iteration s, if 

8 > j^log^ (n 1+5 b n ) (7.83) 

for any 5 > 0, where 6 n = 0(1) is defined below. From the argument above, we know that 
except on an event of vanishing probability, each relevant dimension j < r has bandwidth 
no larger than 

hj < h Q f3^ logl l^ nl ~ t(X ^ (7.84a) 
h 



(n 1 " e a n ) 1 /(4+r)- 



(7.84b) 



Thus, if relevant dimension j has bandwidth hj < ho/3 s , then from Theorem 15.21 we have 
that 



2 



S ; 



C n r ^/^a r n /(i+r) 1 



A)h) ~ A]nh%^ K h d - r (T ' 85a) 



C a r n /{4+r) 1 
A)n A /^+ r )+ 6 hl +d /3 4s 

C a r n /{4+r) : 



(7.85b) 

c a ; /(4+r) 1 

— A 2 „A/(4+r)+S i„4+d /34s (7.85c) 



where 6 = re/ (4 + r). Therefore, 



s 2 

' > log log n (7.86) 
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m case 



±Y > (n l *\) '/("+'> (7.87a) 
= n^'' ( ^-'°g'°f " V /4 (7.87b) 



r ,^r/(4+r) 
\ l_/ tin / 

> n^(^ ( y n ) 1/4 (7.87c) 

which defines 6 n = 0(1). It follows that in iteration s > ^log^ (n 1+<5 6 n ), the probability 
of a relevant variable having estimated derivative Zj above threshold is bounded by 



P(|Zj| > Xj, for some j < r) < P ( ^ > ^ J 



(7.88a) 



< — +iE^f ( 7 - 88c ) 

nc n a/2 log (nc n ) 4 ^ s^- 

< ; r + (7.88d) 

nc n a/2 log (nc n ) 4 log log n 

= O (- — I J (7.88e) 



log log n 



which gives the statement of the theorem. □ 



Proof of Corollary 15.41 Given the bandwidths in (J5.15|) and (|5.16j) we have that the 
squared (conditional) bias is given by 

Bias 2 (m,0 = (Y,3<rAjhfy + op(ti{H*)) (7.89a) 

= ^ AiAjhfhf + o P {ti{H*)) (7.89b) 

i,j<r 

= P (n- 4/(4+r)+e ) (7.89c) 
by Theorem 15.31 Similarly, from the proof of Theorem 15.21 the (conditional) variance is 

Var(rV) = i fr[ + o P (l)) (7.90a) 

= P {n- 1+r l {r+ ^ +e ) (7.90b) 

= P (n- 4 /( 4+r > +e ) (7.90c) 
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where R{K) = j K(u) 2 du. The result follows from the bias-variance decomposition. □ 



D. Proof of Theorem \5.5\ 

This result shows that assumption (Al) can be dropped if we modify the algorithm by simply 
increasing the threshold slightly. Examples of choices for p n are p n = 5h^ or p n = h 3 Q - s for 
some small 5 > 0. 

Let j > r. Without loss of generality assume that fij > 0. Recall that, if variable j is still 
active in round t, fij = ch 3 + Op((h^) 2 /d) for some constant c > 0. It follows from (|5.20j) 
that, for all large n, 

Xj > fij + sj a/ 2 log(nc n ) (7.91) 

and hence 

S 3 r -(\i-ntf/(2s?) 



Aj - Pj nc n ^/2 log(nc n ) 

Then, with £ ~ N(0, 1), we have 

F(\Zj\ > Xj, for some j > r) < ^ 2P > Xj ~ ^ j 



Now consider j <r. It follows from (|5.2(J|) that, for all large n, 



where A'- = Sj a/2 log (nc n ). The calculation then proceeds as before. 



(7.93a) 



< 3=r=^ 0- (7-93c) 

nc n a/ 2 log (nc n ) 



> (7.94) 

for some c > 0. Hence, 



P^A,) = P (^_^<^_^) (7.95a) 



P ( £ > ) (7.95b) 



< P ( £ > — — ^ ) (7.95c) 
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